Skip to content

[fix](insert) fix InsertLoadJob memory leak caused by jobs permanently stuck in PENDING state#62282

Open
sollhui wants to merge 1 commit intoapache:masterfrom
sollhui:fix_job_leak
Open

[fix](insert) fix InsertLoadJob memory leak caused by jobs permanently stuck in PENDING state#62282
sollhui wants to merge 1 commit intoapache:masterfrom
sollhui:fix_job_leak

Conversation

@sollhui
Copy link
Copy Markdown
Contributor

@sollhui sollhui commented Apr 9, 2026

fix #62118

Problem

InsertLoadJob is registered with LoadManager for all insert types during executor
construction, but most executors never transition the job to a terminal state, causing
permanent PENDING leaks:

Type Success Failure Root Cause
External (Hive/Iceberg/MC/JDBC) PENDING PENDING afterExec never calls recordFinishedLoadJob
Blackhole / Dictionary PENDING PENDING afterExec is empty
OLAP (normal) FINISHED PENDING afterExec is skipped on failure — executeSingleInsert does early return in catch block before reaching afterExec
OLAP (group_commit TVF) PENDING PENDING Job registered but execution is BE-driven, FE never calls afterExec

Fix

1. Introduce needRegister flag in AbstractInsertExecutor

Add boolean needRegister to the constructor to make job registration an explicit
opt-in. A default 7-arg constructor sets needRegister = false, so any executor that
does not pass the flag will not register. Only OlapInsertExecutor and
RemoteOlapInsertExecutor pass needRegister = (jobId != -1), preserving existing
behavior for inline-table and group_commit TVF cases where jobId is already -1.

This replaces the implicit pattern where jobId = -1 was used as a side-channel to
skip registration, which was easy to miss when adding new executor types.

2. Fix OLAP insert failure leak

Extract job recording into recordLoadJob(UserIdentity) and call it from both:

  • afterExecerrMsg is empty → job transitions to FINISHED
  • onFailerrMsg is set → job transitions to CANCELLED

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 9, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Copy Markdown
Contributor Author

sollhui commented Apr 9, 2026

run buildall

morrySnow
morrySnow previously approved these changes Apr 9, 2026
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 9, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

PR approved by anyone and no changes requested.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/3) 🎉
Increment coverage report
Complete coverage report

@sollhui sollhui changed the title [fix](group commit) fix memory leak of pending InsertLoadJob in group commit http stream [fix](insert) fix InsertLoadJob memory leak caused by jobs permanently stuck in PENDING state Apr 10, 2026
@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Apr 10, 2026
@sollhui
Copy link
Copy Markdown
Contributor Author

sollhui commented Apr 10, 2026

run buildall

@liaoxin01
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 issue.

  1. Critical checkpoint: task goal and correctness
    Conclusion: Partially met. The PR fixes local OlapInsertExecutor failure cleanup and avoids registering some non-OLAP paths, but it does not fix the equivalent remote OLAP path, so InsertLoadJob leaks still remain for that execution path. No test covering remote OLAP success/failure registration behavior was added.

  2. Critical checkpoint: change size and focus
    Conclusion: Mostly focused and small. The constructor opt-in for registration is a reasonable narrowing, but the fix is not consistently applied to all functionally parallel OLAP insert paths.

  3. Critical checkpoint: concurrency
    Conclusion: No new concurrency or locking risk identified in the changed code.

  4. Critical checkpoint: lifecycle / static initialization
    Conclusion: No special lifecycle or static initialization issue identified.

  5. Critical checkpoint: configuration
    Conclusion: No configuration changes.

  6. Critical checkpoint: compatibility
    Conclusion: No FE/BE protocol or storage compatibility issue identified from this patch.

  7. Critical checkpoint: functionally parallel paths
    Conclusion: Not satisfied. RemoteOlapInsertExecutor remains inconsistent with OlapInsertExecutor; it still registers a local load job but does not finish that same local job on either success or failure.

  8. Critical checkpoint: special conditions / assumptions
    Conclusion: The new group_commit TVF exclusion is clearly commented and justified.

  9. Critical checkpoint: test coverage
    Conclusion: Insufficient for the bug being fixed. The patch needs coverage for remote OLAP insert success/failure so this leak does not regress again.

  10. Critical checkpoint: observability
    Conclusion: Existing logs are probably adequate for debugging this area; no additional observability requirement stands out.

  11. Critical checkpoint: transaction / persistence
    Conclusion: The modified code still relies on LoadManager.recordFinishedLoadJob(...) for final job persistence, but the remote path does not reach the correct local LoadManager instance, so persistence/completion is still incomplete there.

  12. Critical checkpoint: data writes / atomicity
    Conclusion: No new data atomicity issue identified beyond the load-job lifecycle leak.

  13. Critical checkpoint: FE/BE variable passing
    Conclusion: No new FE/BE variable propagation issue identified.

  14. Critical checkpoint: performance
    Conclusion: No material performance concern identified in the changed code.

  15. Critical checkpoint: other issues
    Conclusion: The remaining remote-path leak is the blocking issue.

@sollhui
Copy link
Copy Markdown
Contributor Author

sollhui commented Apr 10, 2026

run buildall

@sollhui
Copy link
Copy Markdown
Contributor Author

sollhui commented Apr 10, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 blocking issue.

  1. fe/fe-core/src/main/java/org/apache/doris/nereids/trees/plans/commands/insert/OlapInsertExecutor.java: this patch changes OlapInsertExecutor to require a new needRegister constructor parameter, but the other subclasses (OlapTxnInsertExecutor, OlapGroupCommitInsertExecutor, and RemoteOlapInsertExecutor) still call the removed 7-argument super(...). FE does not compile in this state, so the fix is not complete.

Critical checkpoint conclusions:

  • Goal and correctness: Partially met. The intended local OLAP failure-path fix is reasonable, but the patch does not build, so the task is not accomplished end-to-end.
  • Scope and clarity: Small and focused overall, but the constructor refactor is incomplete across parallel OLAP executor paths.
  • Concurrency: No new concurrency or lock-order issue observed in the touched code.
  • Lifecycle and static init: No special lifecycle or static-initialization issue observed.
  • Configuration: No new configuration items.
  • Compatibility: No storage-format or FE/BE protocol compatibility change identified in the touched code.
  • Parallel code paths: Not fully handled. Other OlapInsertExecutor subclasses still rely on the old constructor signature.
  • Special conditional checks: The new group_commit TVF skip condition has a clear rationale comment.
  • Test coverage: No regression or unit test accompanies this insert-job lifecycle fix, so the failure and leak paths remain untested.
  • Observability: Existing logging and error propagation are adequate for this change.
  • Transaction and persistence: The intended recordFinishedLoadJob usage follows existing persistence flow and would mark failed jobs as CANCELLED, but the compile break prevents end-to-end validation.
  • Data writes and atomicity: No new write-atomicity issue identified in the intended logic.
  • FE/BE variable passing: Not applicable.
  • Performance: No notable performance concern in the touched paths.
  • Other issues: The constructor break is the primary blocker.

long jobId) {
super(ctx, table, labelName, planner, insertCtx, emptyInsert, jobId);
long jobId, boolean needRegister) {
super(ctx, table, labelName, planner, insertCtx, emptyInsert, jobId, needRegister);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing OlapInsertExecutor to an 8-argument constructor breaks the subclasses that still call the old 7-argument super(...). In this patch, OlapTxnInsertExecutor (line 46), OlapGroupCommitInsertExecutor (line 69), and RemoteOlapInsertExecutor (line 81) still invoke the removed signature, so FE no longer compiles. Please either keep the old overload or update all subclasses in the same change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] HTTP StreamLoad with GroupCommit in 4.0.3 may cause memory leak.

5 participants